Project: Efficient agent training from Human Preferences and Justifications in safety-critical environments

Key information:

Student	Ilias Kazantzidis
Academic Supervisors	Tim Norman, Chris Freeman, Yali Du
Cohort	2
Pure Link	Active Project

Abstract:

Research on trying to build autonomous agents has risen rapidly over the last few years. We can think of agents as autonomous vehicles, robots, drones, recommender systems or any entity or program, where the aim is for them to learn to make good and safe decisions. Policies for making such decisions can be learnt successfully using the intelligent technique of Reinforcement Learning. Here, the agent gradually improves its policy by interacting with the environment (receiving positive rewards for good behaviour and negative for bad). In computer simulations we do not mind about initial bad behaviour, such as blundering of queens or robots destroying their arm during learning. However, in the real world we cannot easily fix a broken arm. Any damage, either of the agent or of the environment is prohibited. To that end, it has been observed that a human’s involvement during such training procedures has been decisive for both safety and performance. However, training does take a long time, such as hours or even days. Hence, we search for methods where a human efficiently gets involved during the real agent training.

In this project we employ the human with the provably effective technique of asking ‘Human Preference’ queries. The agent makes preference queries to a human and uses them to learn a neural network model for the optimal policy. At the same time, it must gradually ‘grow’, i.e. start making decisions without consulting the human repeatedly, as the human cannot be alert forever. On such occasions, we must make sure the agent is not taking any dangerous action. The novel solution we propose for this is to augment the Preferences with Justifications, essentially a warning signal that informs the agent if a proposed action is unsafe. In that way the agent builds the optimal policy in an almost certainly safe way. Moreover, such learning is convenient for the human teacher who only needs to know which action between two is better and recognise if one of them is unsafe. The picture shows experiments from the MINDS CDT lab, where a ‘non-waterproof’ robot (agent) communicates with a human through the laptop (via keyboard or spoken English human input) and learns to follow the optimal path to the goal state (red cell) without ever ‘drowning’ in water (blue cells). Finally, our algorithm uses subtle ideas to minimise the number of queries to the human and thus completes training as fast as possible.